Yiren Wang(23794201) (50%): Data cleaning and transformation, Single variable classification, clustering, Shiny App
Karen Huan Yi Lee(23715313) (50%): Data cleaning and transformation, multivariate classification, clustering, Shiny App
Here is the link for our YouTube video.
The dataset analyzed in this project is called Global Youtube Statistics 2023, which can be obtained from Kaggle platform and it can be downloaded on Kaggle.
This dataset was collected by a Sri Lanka data scientist name Nidula Elgiriyewithana. In this data, he recorded plenty of data abut different aspexts of YouTube by collecting datasets from YouTube gients. This dateset provided includes information such as video views, upload frequency, country of origin, earnings.
First, Set up the random seed and sanitize the working environment.
Then, load the libraries.
Assign the dataset to the variable ‘YT’ after reading the csv file and importing it into R.
# Replace all the missing values with NA
YT[YT == ""] <- NaN
YT <- YT %>%
mutate(
allunemployment = (Unemployment.rate / 100) * Population,
unemployment_urban = (Unemployment.rate / 100) * Urban_population,
educationall = (Gross.tertiary.education.enrollment.... / 100) * Population,
educationurban = (Gross.tertiary.education.enrollment.... / 100) * Urban_population
)
YT <- subset(YT, select = -c(Youtuber, Longitude, Latitude, Title, country_rank, Abbreviation, channel_type_rank,Gross.tertiary.education.enrollment....,Unemployment.rate,video_views_rank))
YT$created_year <- as.character(YT$created_year)
YT$created_date <- as.character(YT$created_date)
# replace all the "nan" values in categorical variables to "missing"
YT[] <- lapply(YT, function(x) {
if (is.character(x)) {
x[x == "nan"] <- "missing"
}
return(x)
})
# delete the rows that have missing values in all numerical variables except Longitude and Latitude
YT <- YT[!(rowSums(is.na(YT[, -c(1, 2)])) == ncol(YT) - 2), ]
# Change missing values with the median value in the certain columns
median_re <- function(YT) {
nucol <- setdiff(names(YT), c("created_year", "created_date", "rank"))
for (col in nucol) {
if (is.numeric(YT[[col]])) {
YT[[col]][is.na(YT[[col]])] <- median(YT[[col]], na.rm = TRUE)
}
}
return(YT)
}
YT <- median_re(YT)
cand.columns <- names(YT)[names(YT) != 'subscribers']
str(YT)## 'data.frame': 995 obs. of 22 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ subscribers : int 245000000 170000000 166000000 162000000 159000000 119000000 112000000 111000000 106000000 98900000 ...
## $ video.views : num 2.28e+11 0.00 2.84e+10 1.64e+11 1.48e+11 ...
## $ category : chr "Music" "Film & Animation" "Entertainment" "Education" ...
## $ uploads : int 20082 1 741 966 116536 0 1111 4716 493 574 ...
## $ Country : chr "India" "United States" "United States" "United States" ...
## $ channel_type : chr "Music" "Games" "Entertainment" "Education" ...
## $ video_views_for_the_last_30_days: num 2.26e+09 1.20e+01 1.35e+09 1.98e+09 1.82e+09 ...
## $ lowest_monthly_earnings : num 564600 0 337000 493800 455900 ...
## $ highest_monthly_earnings : num 9.0e+06 5.0e-02 5.4e+06 7.9e+06 7.3e+06 ...
## $ lowest_yearly_earnings : num 6.8e+06 4.0e-02 4.0e+06 5.9e+06 5.5e+06 ...
## $ highest_yearly_earnings : num 1.08e+08 5.80e-01 6.47e+07 9.48e+07 8.75e+07 ...
## $ subscribers_for_last_30_days : num 2e+06 2e+05 8e+06 1e+06 1e+06 2e+05 2e+05 2e+05 1e+05 6e+05 ...
## $ created_year : chr "2006" "2006" "2012" "2006" ...
## $ created_month : chr "Mar" "Mar" "Feb" "Sep" ...
## $ created_date : chr "13" "5" "20" "1" ...
## $ Population : num 1.37e+09 3.28e+08 3.28e+08 3.28e+08 1.37e+09 ...
## $ Urban_population : num 4.71e+08 2.71e+08 2.71e+08 2.71e+08 4.71e+08 ...
## $ allunemployment : num 73239992 48251210 48251210 48251210 73239992 ...
## $ unemployment_urban : num 25247290 39787465 39787465 39787465 25247290 ...
## $ educationall : num 3.84e+08 2.90e+08 2.90e+08 2.90e+08 3.84e+08 ...
## $ educationurban : num 1.32e+08 2.39e+08 2.39e+08 2.39e+08 1.32e+08 ...
Firstly, the missing values are replaced with ‘NaN’ and irrelevant
columns are removed. ‘Missing’ is the default value for categorical
variables containing the character ‘nan’. In order to identify
interactions such as unemployment rates multiplied by population metrics
and tertiary education enrollment rates multiplied by population,
feature engineering is used to construct new variables based on current
columns. Particularly, rows with entirely missing numerical data, apart
from Longitude and Latitude, are eliminated.
The data integrity is then maintained by using medians to impute missing
numerical values. Data cleaning is vital for maintaining the dataset’s
quality and dependability for analytical purposes and for preparing it
for later machine learning modelling.
In the analysis, the target (response) variable selected is a newly
indicated column subscribersup, which indicates the number
of YouTube channel subscribers. This variable has been carefully crafted
to create a distinct difference between channels that have a large and
low subscriber base. Based on the median value, the dataset is separated
into two groups: the majority of channels have subscribers below
17,700,000, while the top 5% of the data are channels with subscribers
at or above this level. Channels with a lower subscriber count are
labeled [1, 1.77e+07], while channels with a greater subscriber count
are labeled [1.77e+07, Inf]. The target variable is designed to offer
practical insights into audience engagement, content strategy, and
advertising prospects as a focus point of analysis for enhancing YouTube
channel performance and impact.
##
## [0,1.77e+07) [1.77e+07,Inf]
## 490 505
By examining the proportions of outcome variables selected, it can be observed that the outcome is distributed equally. Building reliable and unbiased machine learning models requires balanced representation of both classes: low-subscriber and high-subscriber, which is accomplished by an equal distribution. Balanced class is essential in the context of classification tasks since it avoids the model from being biased towards predicting the majority class, ensuring fair evaluation and trustworthy insights into the reasons influencing high subscriber counts.
In order to improve the effectiveness and accuracy of machine learning models, feature selection is a vital step. To reduce complexity and improve model performance, it involves selecting relevant features from a dataset (Khalid, Khalil, & Nasreen, 2014). The correlation coefficient stands out among the numerous techniques as a useful statistical measure, illuminating the strength and direction of correlations between variables. This coefficient shows to be an essential tool in the field of machine learning, assisting in the discovery and choice of the features that have the greatest impact on the target variable (Hsu & Hsieh, 2010).
Benefits of Correlation Coefficient Feature Selection:
The correlation coefficient is employed for feature selection in the following data modelling process due to its proven efficiency. The advantages listed above, such as improved model performance, increased computational efficiency, and improved interpretability, make it clear why this method is essential for use in data science and machine learning. The correlation coefficient is used to ensure that machine learning models are accurate, efficient, and insightful, optimizing their influence across a variety of domains.
# Identify categorical and numerical variables
YTclass <- setdiff(colnames(YT), c(target,'subscribersup'))
YTCV <- YT %>%
select(all_of(YTclass)) %>%
select_if(~ is.factor(.) || is.character(.)) %>%
names()
YTNV <- YT %>%
select(all_of(YTclass)) %>%
select_if(~ is.numeric(.)) %>%
names()
YTNV_cluster <- YT[YTNV]
cat("There are ",length(YTCV)," categorical variables and ",length(YTNV), "numerical variables.")## There are 6 categorical variables and 15 numerical variables.
The dataset YT’s is being divided into two categories: categorical and numerical variables. Categorical variables are identified based on their factor or character data types whereas numerical variables are recognized based on their numeric data type.
Convert Categorical variables to numerical variables used for creating the single variable model, but still keep the original variable for future multivariate model
for (cV in YTCV) {
catcount <- table(YT[, cV])
countcv <- paste("Numerical", cV, sep = '_')
YT[countcv] <- catcount[match(YT[, cV], names(catcount))]
}
YTnewclass <- setdiff(colnames(YT), target)
YTCV <- YT %>%
select(all_of(YTnewclass)) %>%
select_if(~ is.factor(.) || is.character(.)) %>%
names()
YTNV <- YT %>%
select(all_of(YTnewclass)) %>%
select_if(~ is.numeric(.)) %>%
names()
YTNVdata <- YT[sapply(YT, is.numeric) | sapply(YT, is.integer)]
YTCVdata <- YT[sapply(YT, is.factor) | sapply(YT, is.character)]The YT dataset is subjected to a transformation method. The approach
initially counts the instances of distinct values for each categorical
variable in YTCV and generates a new numerical variable to
reflect these counts. In order to enable numerical analysis and
modelling, the transformation effectively converts categorical variables
into numerical variables.
Using 60/20/20 method to separate the youtube dataset into Training, Calibration, and Testing data.
YTsaperate <- runif(nrow(YT))
YTtrain <- YT[YTsaperate <= 0.6, ]
YTCal <- YT[YTsaperate > 0.6 & YTsaperate <= 0.8, ]
YTtest <- YT[YTsaperate > 0.8, ]
cat("The training set has", nrow(YTtrain), "observations, the calibration set has", nrow(YTCal)," observations,and the test set has", nrow(YTtest), "observations.")## The training set has 613 observations, the calibration set has 179 observations,and the test set has 203 observations.
The ‘runif’ function is used to do the splitting of the data at random, giving each row an equal chance of being assigned to each sets.
A Null Model is created, which is a simple approach that makes direct assumptions about the predictive power of the model beyond the positive class’s base rate.
YTAUC <- function(predcol, outcol, pos=pos.label) {
perf <- performance(prediction(predcol, outcol==pos),'auc')
as.numeric(perf@y.values)
}
YTlog <- function(outCol, predCol, pos=pos.label) {
sum(ifelse(outCol==pos, log(predCol), log(1-predCol)))
}
baseRateCheck <- YTlog(YTCal[,target], sum(YTCal[,target]==pos.label)/length(YTCal[,target]) )| Name | Type | AUC |
|---|---|---|
| Null Model | univariate | 0.5 |
The dataset is fairly balanced, therefore the model will predict
High.Subscribers (TRUE) with a 50% probability, giving it
an Area Under the Curve (AUC) of 0.5. The Null Model
serves as a fundamental benchmark and its performance is assessed using
the AUC metric. These assessments serve as a benchmark for
subsequent machine learning models, making them significant. Any
advanced model created after that should outperform the Null Model,
indicating that it captures significant patterns in the dataset and is
not just making predictions based on chance or the base rate.
Create a function that will be used for model building:
mkPredC <- function(outCol,varCol,appCol, pos=pos.label) {
pPos <- sum(outCol==pos)/length(outCol)
naTab <- table(as.factor(outCol[is.na(varCol)]))
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol),varCol)
pPosWv <- (vTab[pos,]+1.0e-3*pPos)/(colSums(vTab)+1.0e-3)
pred <- pPosWv[appCol]
pred[is.na(appCol)] <- pPosWna
pred[is.na(pred)] <- pPos
pred
}
mkPredN <- function(outCol,varCol,appCol) {
cuts <- unique(as.numeric(quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T)))
varC <- cut(varCol, cuts)
appC <- cut(appCol, cuts)
mkPredC(outCol, varC, appC)
}AUC scores are computed for each variable on both the training and calibration sets. Variables having AUC values over 0.5, showing predictive power superior to chance, are chosen for further analysis.
The AUC result of single variable model is sorted and showned.
AUCYTNV <- data.frame(Pred = character(), Type = character(), TrainAUC = numeric(), CalAUC = numeric())
for (v in YTNV) {
predii <- paste('pred', v, sep = '.')
YTtrain[, predii] <- mkPredN(YTtrain[, target], YTtrain[, v], YTtrain[, v])
YTtest[, predii] <- mkPredN(YTtrain[, target], YTtrain[, v], YTtest[, v])
YTCal[, predii] <- mkPredN(YTtrain[, target], YTtrain[, v], YTCal[, v])
trainsetAUC <- YTAUC(YTtrain[, predii], YTtrain[, target])
if (trainsetAUC >= 0.5) {
CalsetAUC <- YTAUC(YTCal[, predii], YTCal[, target])
AUCYTNV <- rbind(AUCYTNV, data.frame(Predictor = predii, Type = 'univariate', TrainAUC = trainsetAUC, CalAUC = CalsetAUC))
}
}
AUCYTNV <- AUCYTNV %>%
filter(!grepl("pred of subscribers", Predictor)) %>%
mutate(Differ = abs(TrainAUC - CalAUC)) %>%
arrange(Differ) %>%
select(-Differ)
kable(AUCYTNV, digits = 4, align = 'lccc')| Predictor | Type | TrainAUC | CalAUC |
|---|---|---|---|
| pred.rank | univariate | 0.9991 | 0.9987 |
| pred.Numerical_Country | univariate | 0.5576 | 0.5469 |
| pred.unemployment_urban | univariate | 0.5351 | 0.5517 |
| pred.allunemployment | univariate | 0.5478 | 0.5288 |
| pred.Population | univariate | 0.5448 | 0.5250 |
| pred.educationurban | univariate | 0.5478 | 0.5238 |
| pred.Numerical_created_month | univariate | 0.5519 | 0.5277 |
| pred.Urban_population | univariate | 0.5472 | 0.5225 |
| pred.educationall | univariate | 0.5575 | 0.5212 |
| pred.uploads | univariate | 0.5502 | 0.5097 |
| pred.Numerical_created_date | univariate | 0.5314 | 0.4822 |
| pred.Numerical_channel_type | univariate | 0.5720 | 0.5182 |
| pred.Numerical_category | univariate | 0.5871 | 0.5283 |
| pred.video.views | univariate | 0.7976 | 0.7272 |
| pred.subscribers_for_last_30_days | univariate | 0.6238 | 0.5436 |
| pred.lowest_monthly_earnings | univariate | 0.6920 | 0.5984 |
| pred.Numerical_created_year | univariate | 0.5514 | 0.4573 |
| pred.highest_monthly_earnings | univariate | 0.6899 | 0.5898 |
| pred.lowest_yearly_earnings | univariate | 0.6921 | 0.5840 |
| pred.highest_yearly_earnings | univariate | 0.6926 | 0.5780 |
| pred.video_views_for_the_last_30_days | univariate | 0.6935 | 0.5748 |
Based on the double density plot, it can be observed that
there is an absence of overlapping between two categories suggests that
the pred.rank effectively distinguishes between low
subscriber counts channels and high subscriber count channels.
Log likelihood is computed for each single variable in the calibration set in order to evaluate how well the model fits. It is essential that the variables chosen can improve the prediction in order to make the best choices feasible. The variable is statistically significant if the log likelihood is close to the base rate check value.
selPredVars <- character(0)
selVars <- character(0)
minStep <- baseRateCheck
YTlogNV <- data.frame(Pred = character(0), 'Log likelihood' = numeric(0))
for (v in YTNV) {
predii <- paste('pred', v, sep = '.')
likehoodlog <- sum(ifelse(YTCal[, target] == pos.label, log(YTCal[, predii]), log(1 - YTCal[, predii])))
if (likehoodlog > minStep) {
selPredVars <- c(selPredVars, predii)
selVars <- c(selVars, v)
YTlogNV <- rbind(YTlogNV, data.frame(Pred = predii, loglike = likehoodlog))
}
}
kable(YTlogNV, digits = 4, align = 'lc')| Pred | loglike |
|---|---|
| pred.rank | -4.0726 |
| pred.Urban_population | -123.8131 |
| pred.allunemployment | -123.7484 |
| pred.unemployment_urban | -123.4244 |
| pred.Numerical_Country | -123.2850 |
cormap <- YTNVdata %>%
cbind(subscriberupbool = YT$subscribersupbool) %>%
cor() %>%
as.table() %>%
as.data.frame() %>%
filter(Var1 == "subscriberupbool") %>%
filter(Var2 != "subscriberupbool") %>%
filter(Var2 != "subscribers") %>%
distinct(Var2, .keep_all = TRUE) %>%
filter(Freq >= 0.05)
select_variable_corelation <- as.character(cormap$Var2)
select_variable_ODDcorelation <- as.character(cormap$Var2)[seq(1, length(as.character(cormap$Var2)), by = 2)]
select_variable_EVENcorelation <- as.character(cormap$Var2)[seq(2, length(as.character(cormap$Var2)), by = 2)]
select_variable_log <- as.character(YTlogNV$Pred)
kable(cormap)| Var1 | Var2 | Freq |
|---|---|---|
| subscriberupbool | video.views | 0.3361460 |
| subscriberupbool | video_views_for_the_last_30_days | 0.1812781 |
| subscriberupbool | lowest_monthly_earnings | 0.2216069 |
| subscriberupbool | highest_monthly_earnings | 0.2213575 |
| subscriberupbool | lowest_yearly_earnings | 0.2218340 |
| subscriberupbool | highest_yearly_earnings | 0.2216185 |
| subscriberupbool | subscribers_for_last_30_days | 0.1670216 |
| subscriberupbool | unemployment_urban | 0.0572518 |
| subscriberupbool | educationurban | 0.0596882 |
| subscriberupbool | Numerical_Country | 0.0601648 |
| subscriberupbool | Numerical_created_month | 0.0605410 |
Another technique employed is examining the correlation between the predictor variables and the target variables. A significant relationship is shown by a correlation coefficient that is higher than a certain threshold, 0.05 in this case. Additionally, this technique is further refined by separating the selected variables into two group: odd-indexed and even-indexed variable for more understanding of the relationships between the predictor variables and the target variable. As a result, there are total of eleven variables chosen as feature variables as listed above.
The function below is used to evaluate the various models that will be built. The output of the function include ROC plot, confusion matrix, log likelihood, prediction distribution plot, and precision or recall plot.
Multimodellikelog <- data.frame(name = character(), 'model Log likelihood' = numeric(), 'Null log likelihood' = numeric(0))
Multimodelauc <- data.frame(Pred = character(), Type = character(), TrainAuc = numeric(), CalAUC = numeric())
YTmodeleval <- function(namea, predmodel, trueV, do.print = TRUE, threshold = 0.5, pos = pos.label, selection = "all"){
evalmodelauc <- YTAUC(predmodel, trueV)
if (do.print == TRUE) {
if (selection %in% c("all", "conma")) {
YTCMtable <- table(pred = predmodel > threshold, true.values = trueV)
TP <- YTCMtable[2, 2]
FP <- YTCMtable[1, 2]
FN <- YTCMtable[2, 1]
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
cat("Precision:", precision, "\nRecall:", recall, "\nConfusion matrix for", namea, "\n")
print(kable(YTCMtable))
}
if (selection %in% c("all", "loglike")){
Nullpre <- mean(trueV == pos.label)
YTmultilog <- YTlog(trueV, predmodel)
nullmultilog <- YTlog(trueV, Nullpre)
rowlog<- data.frame(name = namea, 'model Log likelihood' = YTmultilog, 'Null log likelihood' = nullmultilog)
Multimodellikelog <<- rbind(Multimodellikelog, rowlog)
cat("Model Log Likelihood is", YTmultilog, "and null Log Likelihood is", nullmultilog, ".\n")
}
if (selection %in% c("all", "roc")) {
perf <- performance(prediction(predmodel, trueV), "tpr", "fpr")
YTROCMulti <- data.frame(
FPR = unlist(perf@x.values),
TPR = unlist(perf@y.values),
threshold = unlist(perf@alpha.values)
)
print(ggplot(YTROCMulti, aes(x = FPR, y = TPR)) +
geom_line(color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = "red") +
labs(x = "FPR", y = "TPR", title = paste("ROC Curve for", namea)) +
theme_minimal())
}
if (selection %in% c("all", "Predpplot")) {
PredPplot <- data.frame(predi = predmodel, truevalue = trueV)
plot <- ggplot(PredPplot, aes(x = predi, color = truevalue)) +
geom_density() +
labs(x = "Predicted probability", title = paste("Prediction distribution plot for", namea)) +
theme_minimal()
print(plot)
}
if (selection == "all" || selection == "Prereplot") {
YTprec <- performance(prediction(predmodel, trueV), measure="prec")
YTrec <- performance(prediction(predmodel, trueV), measure="rec")
precision <- (YTprec@y.values)[[1]]
YTxprec <- (YTprec@x.values)[[1]]
recall <- (YTrec@y.values)[[1]]
rocFrame <- data.frame(threshold=YTxprec, precision=precision, recall=recall)
print(ggplot(rocFrame, aes(x=threshold)) +
geom_line(aes(y=precision, color="precision")) +
geom_line(aes(y=recall, color="recall")) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
labs(x = "threshold", y = "percentage", title = paste("precision and recall plot for", namea)) +
theme_minimal())
}
}
evalmodelauc
}
YTevalmodelauc <- function(namea, type, name, predmodelfortrain, truefortrain, predmodelforcal, trueforcal, plotname){
mtrainauc <- YTmodeleval(namea, predmodelfortrain, truefortrain, do.print = FALSE, selection = plotname)
mcalauc <- YTmodeleval(namea, predmodelforcal, trueforcal, do.print = TRUE, selection = plotname)
mresultauc <<- data.frame(Pred = namea, Type = type, TrainAuc = mtrainauc, CalAUC = mcalauc)
Multimodelauc <<- unique(rbind(Multimodelauc,mresultauc))
}Logistic regression is applied when predicting the likelihood that an instance will belong to a certain class in binary classification task. It works well in scenarios when the dependent variable is categorical and has two possible outcomes. The binomial distribution is used to describe the probability of the results and optimizes parameters to fit the data. Logistic regression efficiently describes the relationship between the independent variables and the binary and binary goal in a logistic regression analysis. Below is the logistic regression with the selected variables:
First, we use the even-indexed data of the correlation:
YTevalmodelauc(namea = 'logic regression for even cc',type = 'multivariate',name = 'logic regression', predmodelfortrain =
predict(YTlogicregressionEVENCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal =
predict(YTlogicregressionEVENCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "conma")## Precision: 0.3510638
## Recall: 0.6470588
## Confusion matrix for logic regression for even cc
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 67| 61|
## |TRUE | 18| 33|
The confusion matrix demonstrates how well the model categorizes channels into the specified categories. According to the precision calculation of 0.3511, 35.11% of the high subscriber channels are correctly identified by the model. Recall is calculated as 0.6471, meaning that the model properly recognize 64.71% of the actual high-subscriber channels.
YTevalmodelauc(namea = 'logic regression for even cc',type = 'multivariate',name = 'logic regression', predmodelfortrain =
predict(YTlogicregressionEVENCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal =
predict(YTlogicregressionEVENCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "loglike")## Model Log Likelihood is -125.8907 and null Log Likelihood is -123.847 .
The log likelihood for the Logistic Regression model with even-indexed variables is -125.8907, which is less than the Null Model. It shows that when compared to the baseline, the logistic regression model with even-indexed variables does not offer a superior fit to the data.
YTevalmodelauc(namea = 'logic regression for even cc',type = 'multivariate',name = 'logic regression', predmodelfortrain =
predict(YTlogicregressionEVENCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal =
predict(YTlogicregressionEVENCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "roc")The model is said to outperform the null model if the curve of the model is above the dashed red line. Based on the ROC Curve above, it can be noticed that the model’s curve is fluctuating above and below the red line, which is not desired.
YTevalmodelauc(namea = 'logic regression for even cc',type = 'multivariate',name = 'logic regression', predmodelfortrain =
predict(YTlogicregressionEVENCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal =
predict(YTlogicregressionEVENCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "Predpplot")The prediction distribution of predicted probabilities for each class is shown in the plot above. It facilitates in the visualization of the prediction confidence of the model.
YTevalmodelauc(namea = 'logic regression for even cc',type = 'multivariate',name = 'logic regression', predmodelfortrain =
predict(YTlogicregressionEVENCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal =
predict(YTlogicregressionEVENCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "Prereplot")The precision and recall values at various threshold are displayed in the plot above. Recall measures the model’s capacity to find all relevant instances, while precision measures the accuracy of positive predictions. The plot offers insights into the trade-off between precision and recall.
The overall performance of logistic regression model with even-indexed variables is not good, the other classifier will be used to check which classifier is more suitable. Also, the other combination of the variables will be checked.
A logistic regression model, YTlogicregressionlog, is
created by using the log-likelihood variables.
YTevalmodelauc(namea = 'logic regression for log',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionlog, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionlog, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "conma")## Precision: 0.9893617
## Recall: 1
## Confusion matrix for logic regression for log
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 85| 1|
## |TRUE | 0| 93|
The Logistic Regression with log likelihood variables
has precision high at 98.94%, demonstrating the model’s precision in
identifying positive cases. Additionally, recall of the model shows 100%
which suggests that the training data has been overfitted or that the
model has memorized the data.
When a model learns the precise patterns in the training data too well, it is said to have overfitted and fails to be able to generalize to new data as the model memorizes the training set instead of discovering the underlying patterns of the data.
Therefore, high recall is desirable but must be balanced with the model’s capacity to generalize to new data, especially in situations where collecting all positive cases is essential.
YTevalmodelauc(namea = 'logic regression for log',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionlog, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionlog, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "loglike")## Model Log Likelihood is -36.83781 and null Log Likelihood is -123.847 .
The
ligistic regression model with log likelihood variables has
log likelihood of -36.8378, demonstrating a more favorable fit than the
null model, with log likelihood of -123.847.
YTevalmodelauc(namea = 'logic regression for log',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionlog, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionlog, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "roc")The ROC curve indicates that the model has no discriminative power and is totally random in its ability to identify positive and negative cases.
This is undesirable in most classification tasks since it means the model is not picking up any meaningful patterns in the data, which means it cannot make predictions.
YTevalmodelauc(namea = 'logic regression for log',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionlog, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionlog, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "Predpplot")YTevalmodelauc(namea = 'logic regression for log',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionlog, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionlog, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "Prereplot")In this section, logistic regression is performed using the selected odd variables based on the correlation coefficients with the target variable.
YTevalmodelauc(namea = 'logic regression for Odd cc',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionODDCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionODDCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "conma")## Precision: 0.5744681
## Recall: 0.7714286
## Confusion matrix for logic regression for Odd cc
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 69| 40|
## |TRUE | 16| 54|
With a precision of 0.574, the
Logistic Regression Using Correlation Coefficient of Odd Variable
model 57.4% correctly forecasts a channel as having “High Subscribers”.
According to recall, the model properly predicts 77.1% of all actual
“High Subscribers” channels.
YTevalmodelauc(namea = 'logic regression for Odd cc',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionODDCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionODDCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "loglike")## Model Log Likelihood is -108.2383 and null Log Likelihood is -123.847 .
The model’s log likelihood is greater than that of the null model, which suggests that the model performs better than null model.
YTevalmodelauc(namea = 'logic regression for Odd cc',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionODDCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionODDCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "roc")The ROC Curve for this model exhibits a convex form above
the null model. This indicates that the model outperforms the null model
and better than
Logistic Regression with log likelihood variables.
YTevalmodelauc(namea = 'logic regression for Odd cc',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionODDCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionODDCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "Predpplot")YTevalmodelauc(namea = 'logic regression for Odd cc',type = 'multivariate',name = 'logic regression', predmodelfortrain = predict(YTlogicregressionODDCC, newdata=YTtrain, type = 'response'), truefortrain = YTtrain[, target],predmodelforcal = predict(YTlogicregressionODDCC, newdata=YTCal, type = 'response'), trueforcal = YTCal[, target], plotname = "Prereplot")The precision for the model is initially relatively low as the model anticipates more low threshold. Precision improves as the threshold rises, indicating fewer false positives. On the contrary, recall is high at lower threshold, but as the threshold rises, recall declines. This indicates that at lower thresholds, the model captures the majority of true “High Subscribers”, but at higher thresholds, it misses some.
Overall, based on the correlation coefficients, the logistic regression model with odd variables performs well compared with other model.
A decision tree model is built based on the correlation coefficients for even variables.
YTevalmodelauc(namea = 'Decision Tree for even cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTEVENCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTEVENCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "conma")## Precision: 0.5744681
## Recall: 0.6352941
## Confusion matrix for Decision Tree for even cc
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 54| 40|
## |TRUE | 31| 54|
A precision of 0.5745 is shown in the Decision Tree for Even Correlation Coefficients, meaning that 57.45% of the channels with high subscriber counts that are predicted are indeed high subscriber channels. Recall, on the other hand, is calculated as 0.6353, indicating that the model successfully predict 63.53% of the actual high subscriber channels.
YTevalmodelauc(namea = 'Decision Tree for even cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTEVENCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTEVENCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "loglike")## Model Log Likelihood is -123.9011 and null Log Likelihood is -123.847 .
The log likelihood for the Decision Tree for Even Correlation Coefficients model is given as -123.9011 whereas the null model has log likelihood of -123.847. The model does not show much difference with the null model.
YTevalmodelauc(namea = 'Decision Tree for even cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTEVENCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTEVENCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "roc")Given that the ROC Curve is above the line of null mode, the Decision Tree classifier for the even correlation coefficient dataset exhibits some predictive power.
YTevalmodelauc(namea = 'Decision Tree for even cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTEVENCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTEVENCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "Predpplot")The decision tree for the even correlation coefficient shows significant overlap between the projected predicted probability distributions for the two classes. The overlap between two classes raises the possibility that, in some circumstances, the model may have troublr distinguishing between two classes.
YTevalmodelauc(namea = 'Decision Tree for even cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTEVENCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTEVENCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "Prereplot")As the threshold rises, recall decreases almost linearly. This pattern of behavior is normal, as the threshold rises, the classifier turns more conservative, predicting fewer positives, hence lowering recall. The precision indicated by the red curve begins with a low value at a threshold and then peaks at a certain threshold before beginning to decline. The precision curve is less stable than the recall curve.
A decision tree model is created by utilizing features chosen based on the log probability variable.
YTevalmodelauc(namea = 'Decision Tree for log', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTlogDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTlogDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "conma")## Precision: 0.9893617
## Recall: 1
## Confusion matrix for Decision Tree for log
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 85| 1|
## |TRUE | 0| 93|
Precision of 0.9894 shows that 98.94% of the positive instances predicted are actually positive instances. Recall of 1 means that 100% of all actual positives are accurately identified by the model. However, as stated before, this indicates overfitting which means the model may memorize the data instead of finding the patterns of the data which is not desirable.
YTevalmodelauc(namea = 'Decision Tree for log', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTlogDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTlogDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "loglike")## Model Log Likelihood is -5.462418 and null Log Likelihood is -123.847 .
In comparison to the null model, the model log likelihood performs much better than the null model log likelihood, showing a considerable improvement in capturing the data patterns.
YTevalmodelauc(namea = 'Decision Tree for log', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTlogDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTlogDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "roc")The curve indicates prefect or nearly perfect classification that the model has attained an extraordinary high level of accuracy in classification tasks. However, it is not a positive sign as the model is overfitting, lacking of robustness, and suspicion of data leakage. It is crucial to guarantee that the model works well with new data and is resilient to changes in input.
YTevalmodelauc(namea = 'Decision Tree for log', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTlogDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTlogDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "Predpplot")The spikes close to 0 and 1 suggest that the model is quite confident in its predictions for the majority of the data points.
YTevalmodelauc(namea = 'Decision Tree for log', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTlogDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTlogDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "Prereplot")A decision tree model’s features are chosen based on the correlation coefficient variable for odd variables.
YTevalmodelauc(namea = 'Decision Tree for odd cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTODDCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTODDCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "conma")## Precision: 0.5319149
## Recall: 0.7142857
## Confusion matrix for Decision Tree for odd cc
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 65| 44|
## |TRUE | 20| 50|
Precision of 0.5319 shows that 53.19% of the positive instances predicted are indeed positive instances. Recall of 0.7143 shows that 71.43% of all actual positives are accurately identified by the model.
YTevalmodelauc(namea = 'Decision Tree for odd cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTODDCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTODDCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "loglike")## Model Log Likelihood is -114.8999 and null Log Likelihood is -123.847 .
In comparison to the baseline, null model, the model log likelihood performs much better than the null model log likelihood, showing a considerable improvement in capturing the patterns of the data.
YTevalmodelauc(namea = 'Decision Tree for odd cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTODDCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTODDCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "roc")The ROC Curve appears to be above the line of the null model, demonstrating the model’s ability to make predictions. There is potential for improvement as the curve is not particularly close to the top-left corner.
YTevalmodelauc(namea = 'Decision Tree for odd cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTODDCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTODDCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "Predpplot")The spread of the blue line represents a wider range of probabilities for the high subscriber channels, while the spike for the red line to the left indicates a high density of predictions leaning towards the low subscriber channels.
YTevalmodelauc(namea = 'Decision Tree for odd cc', type = 'multivariate', name = 'Decision Tree', predmodelfortrain = predict(YTODDCCDT, newdata=YTtrain)[,pos.label], truefortrain = YTtrain[,target], predmodelforcal = predict(YTODDCCDT, newdata=YTCal)[,pos.label],trueforcal = YTCal[,target], plotname = "Prereplot")There appears to be a trade-off between precision and recall. Recall decreases as precision increases, and vice versa.
The k-Nearest Neighbors (kNN) with k=3 is used for classification. In this situation, k=3 denotes that when classifying a new data point, the model takes into account its three closest neighbors.
YTknntrainevenCC <- knn(YTtrain[select_variable_EVENcorelation], YTtrain[select_variable_EVENcorelation], YTtrain[,target], k=3, prob=T)
YTknnCalevenCC <- knn(YTtrain[select_variable_EVENcorelation], YTCal[select_variable_EVENcorelation], YTtrain[,target], k=3, prob=T)
YTtrainknnProbevenCC <- attributes(YTknntrainevenCC)$prob
YTcalknnProbevenCC <- attributes(YTknnCalevenCC)$probThe KNN model is built by the correlation coefficient variable for even variables.
## Precision: 0.5494505
## Recall: 0.5319149
## Confusion matrix for knn mode.
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:--------------|------------:|--------------:|
## |[0,1.77e+07) | 44| 41|
## |[1.77e+07,Inf] | 44| 50|
The kNN model shows a precision of 0.5495 and a recall of 0.5319 when tested on the calibration set. This shows that 53.19% of the actual high subscriber channels are accurately recognized by the model, and that 54.95% of the channels that are classified as high subscriber channels by the model are actually high subscriber channels. Given the values, the model has a moderate capacity to classify positives accurately.
YTKNNcalauc <- YTAUC(YTcalknnProbevenCC, YTCal[,target])
YTknntrainevenCCauc <- YTAUC(YTtrainknnProbevenCC, YTtrain[,target])
KNNauc <- data.frame(Pred = "KNN for even CC", Type = 'multivariate', TrainAuc = YTknntrainevenCCauc, CalAUC = YTKNNcalauc)
Multimodelauc <- unique(rbind(Multimodelauc,KNNauc))
kable(KNNauc)| Pred | Type | TrainAuc | CalAUC |
|---|---|---|---|
| KNN for even CC | multivariate | 0.4785212 | 0.5118899 |
The Area Under the Curve (AUC) of the training set is 0.4785 which is lower than the null model. On the other hand, the calibration set is 0.5119 is relatively low, only above the threshold for the null model.
knnplot <- function(predmodel, trueV, titleString="ROC plot") {
perf <- performance(prediction(predmodel, trueV), "tpr", "fpr")
YTROknn <- data.frame(
FPR = unlist(perf@x.values),
TPR = unlist(perf@y.values),
threshold = unlist(perf@alpha.values)
)
ggplot(YTROknn, aes(x = FPR, y = TPR)) +
geom_line(color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = "red") +
labs(x = "FPR", y = "TPR", title=titleString) +
theme_minimal()
}
knnplot(YTcalknnProbevenCC , YTCal[,target], titleString="AUC of kNN predictions for even CC")The true positive rate (TPR) is plotted against the false positive rate (FPR) trade-off and shown in further detail by the AUC plot, which also illustrates performance of the kNN model. The red dashed line denotes random chance, which is the Null Model, whereas the blue curve represents the performance of the kNN model. The model’s curve is just a little above the null model.
knnprreplot <- function(predmodel, trueV, namea) {
YTprec <- performance(prediction(predmodel, trueV), measure="prec")
YTrec <- performance(prediction(predmodel, trueV), measure="rec")
precision <- (YTprec@y.values)[[1]]
YTxprec <- (YTprec@x.values)[[1]]
recall <- (YTrec@y.values)[[1]]
rocFrame <- data.frame(threshold=YTxprec, precision=precision, recall=recall)
ggplot(rocFrame, aes(x=threshold)) +
geom_line(aes(y=precision, color="precision")) +
geom_line(aes(y=recall, color="recall")) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
labs(x = "threshold", y = "percentage", title = paste("precision and recall plot for", namea)) +
theme_minimal()
}
knnprreplot(YTcalknnProbevenCC , YTCal[,target], 'KNN for even cc')The precision and recall plot for kNN shows how accuracy precision and recall balance out at various threshold values. In the kNN model, recall drops sharply and neither precision nor recall are achieving very high values, indicating possible model improvement opportunities.
In summary, the KNN model for carrelation coefficient for even variables gas limited potential to differenciate the two classes of channels.
YTknntrainlog <- knn(YTtrain[select_variable_log], YTtrain[select_variable_log], YTtrain[,target], k=3, prob=T)
YTknnCal <- knn(YTtrain[select_variable_log], YTCal[select_variable_log], YTtrain[,target], k=3, prob=T)
YTtrainknnProblog <- attributes(YTknntrainlog)$prob
YTcalknnProblog <- attributes(YTknnCal)$probThe KNN model is built using log likelihood variables.
## Precision: 1
## Recall: 0.9893617
## Confusion matrix for knn mode.
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:--------------|------------:|--------------:|
## |[0,1.77e+07) | 85| 0|
## |[1.77e+07,Inf] | 1| 93|
The KNN model With Log Likelihood Variable shows
precision of 1 which indicates that the model is overfitting that is not
desirable. Recall of 0.9894 indicates that 98.94% of all real positives
are accurately identified by the model.
| Pred | Type | TrainAuc | CalAUC |
|---|---|---|---|
| KNN | multivariate | 0.5034015 | 0.5058824 |
The AUC value for the training set (0.5034) and calibration set (0.5059) are nearly equal to 0.5, showing that the model’s ability to distinguish is comparable to the null model.
knnplot <- function(predmodel, trueV, titleString="ROC plot for log") {
perf <- performance(prediction(predmodel, trueV), "tpr", "fpr")
YTROknn <- data.frame(
FPR = unlist(perf@x.values),
TPR = unlist(perf@y.values),
threshold = unlist(perf@alpha.values)
)
ggplot(YTROknn, aes(x = FPR, y = TPR)) +
geom_line(color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = "red") +
labs(x = "FPR", y = "TPR", title=titleString) +
theme_minimal()
}
knnplot(YTcalknnProblog , YTCal[,target], titleString="AUC of kNN predictions for log")The line of the model, represented by the blue line, is almost the same as the null model, represented by the red line.
knnprreplot <- function(predmodel, trueV, namea) {
YTprec <- performance(prediction(predmodel, trueV), measure="prec")
YTrec <- performance(prediction(predmodel, trueV), measure="rec")
precision <- (YTprec@y.values)[[1]]
YTxprec <- (YTprec@x.values)[[1]]
recall <- (YTrec@y.values)[[1]]
rocFrame <- data.frame(threshold=YTxprec, precision=precision, recall=recall)
ggplot(rocFrame, aes(x=threshold)) +
geom_line(aes(y=precision, color="precision")) +
geom_line(aes(y=recall, color="recall")) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
labs(x = "threshold", y = "percentage", title = paste("precision and recall plot for", namea)) +
theme_minimal()
}
knnprreplot(YTcalknnProblog , YTCal[,target], 'KNN for log')The precision and recall plot shows a sharp decline in recall as the threshold gets closer to 1. This implies that the model tends to predict fewer channels as high subscriber as the threshold increases.
YTknntrainODDcc <- knn(YTtrain[select_variable_ODDcorelation], YTtrain[select_variable_ODDcorelation], YTtrain[,target], k=3, prob=T)
YTknnCalODdcc <- knn(YTtrain[select_variable_ODDcorelation], YTCal[select_variable_ODDcorelation], YTtrain[,target], k=3, prob=T)
YTtrainknnProbodCC <- attributes(YTknntrainODDcc)$prob
YTcalknnProbodCC <- attributes(YTknnCalODdcc)$probThe KNN model is constructed using correlation coefficient of odd variables.
## Precision: 0.6292135
## Recall: 0.5957447
## Confusion matrix for knn mode.
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:--------------|------------:|--------------:|
## |[0,1.77e+07) | 52| 33|
## |[1.77e+07,Inf] | 38| 56|
Precision of 0.6292 shows that around 62.92% of predictions for positive instances are true positives. Recall of 0.5957 indicates that 59.57% of all actual positives are accurately identified by the model.
| Pred | Type | TrainAuc | CalAUC |
|---|---|---|---|
| KNN | multivariate | 0.5608858 | 0.5628911 |
The AUC values for the training set (0.5609) and the calibration set (0.5629) show that the model’s ability to distinguish is slightly better compared to null model. However, the ability to distinguish is still limited.
knnplot <- function(predmodel, trueV, titleString="ROC plot") {
perf <- performance(prediction(predmodel, trueV), "tpr", "fpr")
YTROknn <- data.frame(
FPR = unlist(perf@x.values),
TPR = unlist(perf@y.values),
threshold = unlist(perf@alpha.values)
)
ggplot(YTROknn, aes(x = FPR, y = TPR)) +
geom_line(color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = "red") +
labs(x = "FPR", y = "TPR", title=titleString) +
theme_minimal()
}
knnplot(YTcalknnProbodCC , YTCal[,target], titleString="AUC of kNN predictions of odd cc")The model outperforms slightly better the null model but there is still improvement needed.
knnprreplot <- function(predmodel, trueV, namea) {
YTprec <- performance(prediction(predmodel, trueV), measure="prec")
YTrec <- performance(prediction(predmodel, trueV), measure="rec")
precision <- (YTprec@y.values)[[1]]
YTxprec <- (YTprec@x.values)[[1]]
recall <- (YTrec@y.values)[[1]]
rocFrame <- data.frame(threshold=YTxprec, precision=precision, recall=recall)
ggplot(rocFrame, aes(x=threshold)) +
geom_line(aes(y=precision, color="precision")) +
geom_line(aes(y=recall, color="recall")) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
labs(x = "threshold", y = "percentage", title = paste("precision and recall plot for", namea)) +
theme_minimal()
}
knnprreplot(YTcalknnProbodCC , YTCal[,target], 'KNN of odd cc')Recall drops dramatically as the threshold rises, indicating that when the threshold rises, the model becomes less confident in its positive predictions. Although the recall decreases, precision remains relatively high, showing that the model’s positive predictions are mostly correct. The model is missing many true positives based on the dramatic drop in recall with rising threshold.
| name | model.Log.likelihood | Null.log.likelihood |
|---|---|---|
| logic regression for even cc | -125.890693 | -123.847 |
| logic regression for log | -36.837815 | -123.847 |
| logic regression for Odd cc | -108.238326 | -123.847 |
| Decision Tree for even cc | -123.901109 | -123.847 |
| Decision Tree for log | -5.462418 | -123.847 |
| Decision Tree for odd cc | -114.899852 | -123.847 |
When compared to the Null log likelihood,
Decision Tree for Log has the highest log likelihood value,
-5.4624. This suggests that among the decision tree models, it offers
the best fit to the data. Logistic regression for log also
stands out with a very high log likelihood, indicating that this model
also fits the data rather well.
| Pred | Type | TrainAuc | CalAUC | |
|---|---|---|---|---|
| 10 | pred.rank | univariate | 0.9990738 | 0.9986859 |
| 5 | Decision Tree for log | multivariate | 0.9951613 | 0.9946809 |
| 2 | logic regression for log | multivariate | 0.9999468 | 0.9929287 |
| 3 | logic regression for Odd cc | multivariate | 0.8058448 | 0.7324155 |
| 23 | pred.video.views | univariate | 0.7976419 | 0.7271589 |
| 6 | Decision Tree for odd cc | multivariate | 0.7931598 | 0.6979975 |
| 25 | pred.lowest_monthly_earnings | univariate | 0.6919727 | 0.5984355 |
| 4 | Decision Tree for even cc | multivariate | 0.6818003 | 0.5959950 |
| 27 | pred.highest_monthly_earnings | univariate | 0.6899500 | 0.5897997 |
| 28 | pred.lowest_yearly_earnings | univariate | 0.6920739 | 0.5840426 |
The TrainAUC for Logic regression for log is
0.9999, which is nearly perfect. It also has a very high CalAUC
of 0.9929, which indicates great performance in both training and
testing. With a CalAUC of 0.9947, Decision Tree for Log
also shows a great performance. However, as the true positives (TP) for
both of the model stated above are 0, these two models will not be taken
into consideration. As a result,
logic regression for Odd cc model will be selected as its
TrainAUC is 0.8058 and CalAUC is 0.7324. It perform
much better than the other models.
YTdiffModel <- list(
list(predmodel = predict(YTlogicregressionEVENCC, newdata = YTtrain, type = 'response'), trueV = YTtrain[, target], namea = "LogisticRegression for evencc"),
list(predmodel = predict(YTlogicregressionODDCC, newdata = YTtrain, type = 'response'), trueV = YTtrain[, target], namea = "LogisticRegression for ODDcc"),
list(predmodel = predict(YTlogicregressionEVENCC, newdata = YTtrain, type = 'response'), trueV = YTtrain[, target], namea = "LogisticRegression for evencc"),
list(predmodel = predict(YTEVENCCDT, newdata = YTtrain)[, pos.label], trueV = YTtrain[, target], namea = "DecisionTree for even CC"),
list(predmodel = predict(YTlogDT, newdata = YTtrain)[, pos.label], trueV = YTtrain[, target], namea = "DecisionTree for log"),
list(predmodel = predict(YTODDCCDT, newdata = YTtrain)[, pos.label], trueV = YTtrain[, target], namea = "DecisionTree for odd CC"),
list(predmodel = YTcalknnProbevenCC, trueV = YTCal[, target], namea = "KNN for even cc"),
list(predmodel = YTcalknnProblog, trueV = YTCal[, target], namea = "KNN for log"),
list(predmodel = YTcalknnProbodCC, trueV = YTCal[, target], namea = "KNN for odd cc")
)
YTmodels <- do.call(rbind, lapply(YTdiffModel, as.data.frame))
YTrocG <- list()
for (model_name in unique(YTmodels$namea)) {
model_subset <- YTmodels[YTmodels$namea == model_name, ]
perf <- performance(prediction(model_subset$predmodel, model_subset$trueV), "tpr", "fpr")
YTROCMulti <- data.frame(
FPR = unlist(perf@x.values),
TPR = unlist(perf@y.values),
threshold = unlist(perf@alpha.values),
namea = model_name
)
YTrocG[[model_name]] <- YTROCMulti
}
combined_roc <- do.call(rbind, YTrocG)
ggplot(combined_roc, aes(x = FPR, y = TPR, color = namea)) +
geom_line() +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = "red") +
labs(x = "FPR", y = "TPR", title = "ROC Curves for Multiple Models") +
labs(color = "Model") +
theme_minimal()In conclusion, although the Decision Tree for log seems
to be perfect, it is overfitting that it may not predict the new data
accurately. Hence,
Logistic Regression of Correlation Coefficient for Odd Variables
outperforms the other model.
As a result,
Logistic Regression of Correlation Coefficient for Odd Variables
is selected as the final mode. The evaluation of the final selected
model by usign the testing set is shown below.
## Precision: 0.5841584
## Recall: 0.8082192
## Confusion matrix for logistic regression fpr for final model
##
##
## | | [0,1.77e+07)| [1.77e+07,Inf]|
## |:-----|------------:|--------------:|
## |FALSE | 88| 42|
## |TRUE | 14| 59|
## [1] 0.7639293
Based on these metrics, it appears that 58.42% of the model’s positive predictions are accurate. Additionally, the model is able to recognize approximately 80.82% of all true positive.
## Model Log Likelihood is -115.3738 and null Log Likelihood is -140.7064 .
## [1] 0.7639293
The log likelihood value represents how well the model fits the data.
The closer the number is to 0, the model fits the data better. The
Logistic Regression of Correlation Coefficient for Odd Variables
model is better than the null model as it is nearer than 0 compared to
the null model.
## [1] 0.7639293
The model’s classification is obvious from the supplied graph, where the blue curve consistently maintains a large distance above the null model.
## [1] 0.7639293
In terms of model selection, single variable model, multivariate
models, logistic regression, decision tree, and KNN model are used.
Through the comparison of AUC, confusion matrix,
ROC curve, predicted distribution, and precision
and recall plot, logistic regression is the most
accurate model among the other model. Besides, log likelihood,
odd variables, even variables are used and
odd variables provides more accurate predictions. Hence,
logistic regression with correlation coefficient of odd variables
is chosen as the final model to do the predictions.
The data is being reselected and transformed for clustering. Some
variables such as unemployment rate and gross tertiary
education enrollment are converted from percentage to the actual
number of unemployed people based on population, monthly earnings and
yearly earnings are transformed. Next, using the select
function, redundant columns that are not numerical variables and has
been transformed. Then using filter function, rows with
missing values for particular columns are removed. Lastly, columns means
are used to impute missing values in the dataset.
YT_cluster <- read.csv("./youtube_UTF_8.csv", header = TRUE, encoding = "UTF-8")
YTNV_cluster <- YT_cluster %>%
mutate(
allunemployment = (Unemployment.rate / 100) * Population,
unemployment_urban = (Unemployment.rate / 100) * Urban_population,
educationall = (Gross.tertiary.education.enrollment.... / 100) * Population,
educationurban = (Gross.tertiary.education.enrollment.... / 100) * Urban_population,
averagemonthearning = (lowest_yearly_earnings+ highest_yearly_earnings) / 2,
averageyearearning = (lowest_monthly_earnings+ highest_monthly_earnings) / 2
) %>%
select(-highest_monthly_earnings, -lowest_monthly_earnings, -lowest_yearly_earnings, -highest_yearly_earnings, -Gross.tertiary.education.enrollment...., -Unemployment.rate, -video_views_rank, -country_rank, -channel_type_rank, -created_year, -created_date) %>%
filter(rowSums(is.na(select(., (ncol(.)-9):(ncol(.)-2)))) == 0) %>%
select_if(is.numeric) %>%
mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))Descriptive statistics for the new data like minimum, median, mean, standard deviation, and maximum are performed for data checking.
| Min | Med | Mean | Max | SD | |
|---|---|---|---|---|---|
| rank | 1.0000 | 5.115000e+02 | 5.052385e+02 | 9.950000e+02 | 2.873427e+02 |
| subscribers | 12300000.0000 | 1.755000e+07 | 2.285573e+07 | 2.450000e+08 | 1.784798e+07 |
| video.views | 0.0000 | 7.712618e+09 | 1.123485e+10 | 2.280000e+11 | 1.479528e+10 |
| uploads | 0.0000 | 8.755000e+02 | 1.040791e+04 | 3.013080e+05 | 3.631451e+04 |
| video_views_for_the_last_30_days | 1.0000 | 7.697800e+07 | 1.823042e+08 | 6.589000e+09 | 4.220909e+08 |
| subscribers_for_last_30_days | 1.0000 | 3.519155e+05 | 3.519155e+05 | 8.000000e+06 | 5.061499e+05 |
| Population | 202506.0000 | 3.282395e+08 | 4.303873e+08 | 1.397715e+09 | 4.727947e+08 |
| Urban_population | 35588.0000 | 2.706630e+08 | 2.242150e+08 | 8.429340e+08 | 1.546874e+08 |
| Latitude | -38.4161 | 3.709024e+01 | 2.663278e+01 | 6.192411e+01 | 2.056053e+01 |
| Longitude | -172.1046 | -5.192528e+01 | -1.412815e+01 | 1.382529e+02 | 8.476081e+01 |
| allunemployment | 16929.5016 | 4.825121e+07 | 3.491125e+07 | 7.323999e+07 | 2.670902e+07 |
| unemployment_urban | 2975.1568 | 2.524729e+07 | 2.187965e+07 | 3.978747e+07 | 1.593357e+07 |
| educationall | 15390.4560 | 2.895073e+08 | 2.031631e+08 | 7.072438e+08 | 1.403330e+08 |
| educationurban | 2704.6880 | 1.323599e+08 | 1.305702e+08 | 4.265246e+08 | 9.038184e+07 |
| averagemonthearning | 0.0000 | 1.594175e+06 | 4.002513e+06 | 8.680000e+07 | 7.439542e+06 |
| averageyearearning | 0.0000 | 1.334000e+05 | 3.333274e+05 | 7.225450e+06 | 6.194923e+05 |
Standardize the data to have a mean of 0 and a standard deviation of 1 for each variable. This makes sure that no single variable’s scale improperly affects the cluster formation.
## rank video.views
## 5.052385e+02 1.123485e+10
## uploads video_views_for_the_last_30_days
## 1.040791e+04 1.823042e+08
## subscribers_for_last_30_days Population
## 3.519155e+05 4.303873e+08
## Urban_population Latitude
## 2.242150e+08 2.663278e+01
## Longitude allunemployment
## -1.412815e+01 3.491125e+07
## unemployment_urban educationall
## 2.187965e+07 2.031631e+08
## educationurban averagemonthearning
## 1.305702e+08 4.002513e+06
## averageyearearning
## 3.333274e+05
## rank video.views
## 2.873427e+02 1.479528e+10
## uploads video_views_for_the_last_30_days
## 3.631451e+04 4.220909e+08
## subscribers_for_last_30_days Population
## 5.061499e+05 4.727947e+08
## Urban_population Latitude
## 1.546874e+08 2.056053e+01
## Longitude allunemployment
## 8.476081e+01 2.670902e+07
## unemployment_urban educationall
## 1.593357e+07 1.403330e+08
## educationurban averagemonthearning
## 9.038184e+07 7.439542e+06
## averageyearearning
## 6.194923e+05
hc <- hclust(dist(scale(YTNV_cluster[,colnames(YTNV_cluster)[-2]]), method = "euclidean"), method = "ward.D2")
hcplot <- as.dendrogram(hc)
plot(hcplot, ylab = "Height", leaflab = "none", horiz = FALSE)
hightnu <- c(61.5, 55.4, 33.7)
coline <- c("royalblue2", "royalblue3", "royalblue4")
for (i in 1:length(hightnu)) {
abline(h = hightnu[i], col = coline[i], lty = 8, lwd = 1)
}
text(3, 20, paste("k >=", 5), col = "blue", cex = 0.6)
text(3, c(59, 48), paste("k =", 3:4), col = "blue", cex = 0.6)
text(3, 65, paste("k =", 2), col = "blue", cex = 0.6)A dendogram illustrates the structure of a dataset by displaying the the link between clusters of data points. A new cluster is created when two branches combined. The blue dotted horizontal lines in the dendogram reflect various levels at which clusters form, and k-values show overall there are how many clusters. Based on the dendogram above, two major clusters are formed when k is equal to 2. The three and four clusters are further divided at k=3 and k=4. At k greater than and equal to 5, the dendogram predicts further subdivisions, but the dendogram above does not explicitly display them. Hence, we need to find the suitable k value.
wss <- function(cluster) {
c0 <- colMeans(cluster)
sum(apply(cluster, 1, function(row) sum((row - c0) ^ 2)))
}
# Calculate the total WSS
Totalwss <- function(df, labels) {
sumwss <- 0
for (i in 1:length(unique(labels))) {
sumwss <- sumwss + wss(subset(df, labels == i))
}
sumwss
}
indexch <- function(df, kmax, data) {
npts <- nrow(df)
wss.value <- numeric(kmax)
wss.value[1] <- wss(df)
for (k in 2:kmax) {
labels <- cutree(data, k)
wss.value[k] <- Totalwss(df, labels)
}
B <- (wss(df) - wss.value)/ (0:(kmax - 1))
w <- wss.value / (npts - 1:kmax)
data.frame(k = 1:kmax, CH_index = B/w, WSS = wss.value)
}The CH index is a benchmark for figuring out the ideal cluster count in hierarchical clustering. It strikes a balance between two factors of clustering: Compactness (how far apart elements in a cluster are from one another) and Separation. By determining the distance between data points within a cluster and its center, compactness is determined by how tightly a cluster is. A smaller number indicates clearly defined cluster. On the other hand, separation measures the separation between clusters by computing the distance between the cluster centers and the centroid of the entire dataset. A higher value indicates clear, separated clusters. Hence, a good clustering should have a higher CH index that signifies well-separated and have little internal variation. The k value with the highest CH index that denotes the most distinct and consistent clustering structure, is the goal to identify the ideal cluster number.
ch_criterion <- indexch(scaled_df, 10, hc)
grid.arrange(
ggplot(ch_criterion, aes(x = k, y = CH_index)) +
geom_point() +
geom_line(colour = "red") +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
labs(y = "CH index") +
theme_minimal(),
ggplot(ch_criterion, aes(x = k, y = WSS), color = "blue") +
geom_point() + geom_line(colour = "blue") +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
theme_minimal(),
ncol = 2
)Based on the plot, CH index is maximized at k=4 that appears to show a large peak.
The dendogram shows the formation of four clusters by drawing red rectangles around the clusters.
kmch <- kmeansruns(scaled_df, krange=1:10, criterion="ch")
kmasw <- kmeansruns(scaled_df, krange=1:10, criterion="asw")
kmCritframe <- data.frame(k=1:10, ch=kmch$crit,asw=kmasw$crit)
grid.arrange(ggplot(kmCritframe, aes(x=k, y=ch)) +
geom_point() + geom_line(colour="red") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="CH index") +
theme_minimal(),
ggplot(kmCritframe, aes(x=k, y=asw)) +
geom_point() + geom_line(colour="blue") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="ASW") +
theme_minimal(),
nrow=1)Plot above demonstrates peaks at particular k values, which may point to the best possible grouping. The best cluster count, according to the CH index plot, is k = 3, while the optimum cluster count, according to the ASW plot is k = 4.
princ <- prcomp(scaled_df)
project2D <- as.data.frame(predict(princ, newdata = scaled_df)[, 1:2])
convexHullcheck <- function(proj2ddata, groups) {
do.call(rbind, lapply(unique(groups), function(c) {
f <- subset(proj2ddata, cluster == c)
f[chull(f), ]
}))
}
plot_clusters <- function(number) {
groups <- cutree(hc, number)
convexHulldf <<- cbind(
project2D,
cluster = as.factor(groups),
subscriber = YTNV_cluster$subscribers
)
convex_hull <<- convexHullcheck(convexHulldf, groups)
ggplot(convexHulldf, aes(x = PC1, y = PC2)) +
geom_point(aes(shape = cluster, color = cluster, alpha = 0.1)) +
geom_polygon(data = convex_hull, aes(group = cluster, fill = cluster), alpha = 0.4, linetype = 0) +
labs(title = sprintf("k = %d", number)) +
theme(legend.position = "none") +
theme_minimal()
}
# Generate plots for k=3 and k=4
plots <- lapply(c(3,4), plot_clusters)
# Display plots side by side
grid.arrange(plots[[1]], plots[[2]], ncol = 2)The data is divided into two and three clusters and decide which cluster is more ideal.
It can be observed when k = 2 the clusters are clearly separated from one another, however when k = 3 there the cluster is overlapping.
At this point, the dataset will be clustered using the k-mean algorithm, and the criteria for Cluster 1 and Cluster 2 will be determined based on the traits of each cluster after clustering.
| clusters | Freq |
|---|---|
| 390 | 1 |
| 482 | 1 |
## rank video.views uploads video_views_for_the_last_30_days
## 1 -0.06342256 0.0784036 0.0006477713 0.02517707
## 2 0.07838378 -0.0968988 -0.0008005789 -0.03111628
## subscribers_for_last_30_days Population Urban_population Latitude
## 1 0.01698760 0.5539946 0.7594241 0.2288399
## 2 -0.02099493 -0.6846805 -0.9385703 -0.2828227
## Longitude allunemployment unemployment_urban educationall educationurban
## 1 -0.2393459 0.8264962 0.8053987 0.8560584 0.7907666
## 2 0.2958070 -1.0214646 -0.9953902 -1.0580003 -0.9773064
## averagemonthearning averageyearearning
## 1 0.05044994 0.05064612
## 2 -0.06235096 -0.06259341
## [1] 482 390
Considering the prior analysis, the dataset is divided into 2 sets of
872 data. Cluster 1 has been assigned 482 data whereas
Cluster 2 has been assigned 390 data. The data points in
Cluster 1 appear to have relatively higher values for the
variables video.views, uploads,
urban_population, and education. On the other
hand, Cluster 2 appears to reflect data points with average
values that are lower for the majority of the variables.
## Length Class Mode
## result 7 hclust list
## noise 1 -none- logical
## nc 1 -none- numeric
## clusterlist 2 -none- list
## partition 872 -none- numeric
## clustermethod 1 -none- character
## nccl 1 -none- numeric
## [1] 1 1
Both the Cluster 1 and Cluster 2 are very stable.
In a nutshell, a deeper comprehension of the underlying structures in the YouTube dataset has been made by clustering analysis. The ideal number of clusters is determined and k=4 clusters are the most effective for our dataset. Key ideas provide the foundation for the rationale behind this.
The testing indicates that four clusters provide the optimal clustering, which corresponds to the illustration above that uses the CH index to determine the ideal number pf clusters. This indicates that the data points are distinctly divided from the other groups and densely packed within their respective clusterings.
From the perspective of kmeans clustering, k=2 is preferred as it shows a clearer separation with no overlapping.
Finally, since the cluster plot in Kmean cluster is clearer and more dispersed than Hierarchical Clustering, and we can clearly figure out the clusters. We choose Kmean clustering.
Elssied, N. O. F., Ibrahim, O., & Osman, A. H. (2014). A novel feature selection based on one-way anova f-test for e-mail spam classification. Research Journal of Applied Sciences, Engineering and Technology, 7(3), 625-638.
Hsu, H. H., & Hsieh, C. W. (2010). Feature Selection via Correlation Coefficient Clustering. J. Softw., 5(12), 1371-1377.
Karatza, P., Dalakleidi, K., Athanasiou, M., & Nikita, K. S. (2021, November). Interpretability methods of machine learning algorithms with applications in breast cancer diagnosis. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (pp. 2310-2313). IEEE.
Khalid, S., Khalil, T., & Nasreen, S. (2014, August). A survey of feature selection and feature extraction techniques in machine learning. In 2014 science and information conference (pp. 372-378). IEEE.
Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data classification: Algorithms and applications, 37.